Significance of Invariant Acoustic Cues in a Probabilistic Framework for Landmark-based Speech Recognition

نویسنده

  • Amit Juneja
چکیده

A probabilistic framework for landmark-based speech recognition that utilizes the sufficiency and context invariance properties of acoustic cues for phonetic features is presented. Binary classifiers of the manner phonetic features "sonorant", "continuant" and "syllabic" operate on each frame of speech, each using a small number of relevant and sufficient acoustic parameters to generate probabilistic landmark sequences. The relative nature of the parameters developed for the extraction of acoustic cues for manner phonetic features makes them "invariant" of the manner of neighboring speech frames. This invariance of manner acoustic cues makes the use of only those three classifiers along with the speech/silence classifier complete irrespective of the manner context. The obtained landmarks are then used to extract relevant acoustic cues to make probabilistic binary decisions for the place and voicing phonetic features. Similar to the invariance property of the manner acoustic cues, the acoustic cues for place phonetic features extracted using manner landmarks are invariant of the place of neighboring sounds. Pronunciation models based on phonetic features are used to constrain the landmark sequences and to narrow the classification of place and voicing. Preliminary results have been obtained for manner recognition and the corresponding landmarks. Using classifiers trained from the phonetically rich TIMIT database, 80.2% accuracy was obtained for broad class recognition of the isolated digits in the TIDIGITS database which compares well with the accuracies of 74.8% and 81.0% obtained by a hidden Markov model (HMM) based system using mel-frequency cepstral coefficients (MFCCs) and knowledge-based parameters, respectively. INTRODUCTION A probabilistic framework for a landmark-based approach to speech recognition based on representation of speech sounds by bundles of binary-valued phonetic features (Chomsky and Halle, 1968) is presented. The framework exploits two properties of the developed acoustic cues of distinctive features – sufficiency and invariance. Sufficiency of a small number of acoustic parameters (APs) that target the acoustic correlates of a phonetic feature makes the framework use only those APs for a probabilistic decision on that feature. Invariance of APs for a phonetic feature is assumed (and verified in this work) with the variation of context, for example, the APs for the feature sonorant are assumed to be invariant of whether the sonorant frame is in a vowel, nasal or a semivowel context. Similarly, the APs for the place feature alveolar of stop consonants are assumed to be independent of the vowel context (Stevens, 1999). In this paper, it is shown how the framework utilizes the two properties of the APs. Juneja et al. Significance of Invariant Acoustic Cues From Sound to Sense: June 11 – June 13, 2004 at MIT C-152 Although the APs may not strictly possess these properties, it is shown using the APs for one of the phonetic features that these properties may be approximately correct. In the enhancements to the event-based system (EBS) (Espy-Wilson, 1994, Bitar, 1997) presented in this article, probabilistic landmark sequences related to manner phonetic features are located in the speech signal. The landmarks are then analyzed for place and voicing phonetic features, resulting in a complete set of features for the description of a word or a sentence. The lexicon can then be accessed using the representation of words in terms of sequences of bundles of phonetic features. By selectively using knowledge based APs and conducting landmark-based analysis, EBS offers a number of advantages. First, the analysis at different landmarks can be carried out by a different procedure, e.g., a higher resolution can be used in the analysis of burst landmarks of stop consonants than for the syllabic peak landmarks of vowels. Second, a different set of measurements can be used at different landmarks for the analysis of the place features. Third, the selective analysis makes it easy to pinpoint the exact source of errors in the recognition system. The presented probabilistic framework for EBS is similar to the SUMMIT framework (Halberstadt, 1998) in the sense that both systems carry out multiple segmentations and then use these segmentations for further analysis. Unlike the SUMMIT system, EBS is strictly based on acoustic landmarks and articulatory features and uses the idea of minimal knowledge based acoustic correlates of phonetic features. EBS requires only binary classifiers operating on a fixed number of parameters for each classifier, obtained from each frame of speech in the case of manner classification and from specific landmarks in the case of place and voicing classification. Support Vector Machines (SVMs) (Vapnik, 1995) have been chosen for the purpose of classification because SVMs have been shown to be effective for distinctive feature detection in speech (Niyogi, 1998). METHOD The probabilistic framework presented in this section assumes the sufficiency and invariance properties of the APs. The validity of these assumptions is then discussed in the next section. The problem of speech recognition can be expressed as maximization of the posterior probability ) | ( ) | ( ) | ( LO U P O L P O LU P = of landmark sequence M i i l L 1 } { = = and the sequence of feature bundles N i i U U 1 } { = = , given the observation sequenceO . The meaning of these symbols is explained in Figure 1(a) which shows their canonical values for the word “one”. Landmarks can be associated with five broad classes – vowel (V), fricative (Fr), sonorant consonant (SC), stop burst (ST) and silence (SIL) – as shown in Figure 1(b). The sequence of landmarks for an utterance can be completely determined by its broad class sequence. Therefore, O) | P(B O) | P(L = where M i 1 i} {B B = = is a sequence of broad classes for which the landmark sequence L is obtained. Note that there is no temporal information contained in B , L and U and these are only sequences of symbols. Given a sequence O ={o1, o2,..., oT,} of T frames, where ot is the vector of all APs at time t , the broad class segmentation problem can be stated as a maximization of ) | ( O BD P over all B and D . EBS does not use all of the APs at each frame; however, all of the APs are assumed to be available so as to develop the probabilistic framework. EBS uses the probabilistic phonetic Juneja et al. Significance of Invariant Acoustic Cues From Sound to Sense: June 11 – June 13, 2004 at MIT C-153 feature hierarchy in Figure 1(b) to segment speech into the four broad classes and silence. The concept of probabilistic hierarchies has appeared before with application to phonetic classification (Halberstadt, 1998), but, to the best of our knowledge, it has not been used as a uniform framework for landmark detection and phonetic classification. Calculation of ) | ( O BD P for all D is computationally very intensive. Therefore, ) | ( O B P is approximated as ) | ( max O BD P D . This approximation is similar to the one made by the Viterbi algorithm in HMM decoding as well as by the SUMMIT system (Halberstadt, 1998). Using the probabilistic hierarchy, the posterior probability of a frame being part of a vowel at time t can be expressed as (using t P to denote the posterior probability of a feature or a set of features at time t) ) , , | ( ) , | ( ) | ( ) | , , ( ) | ( O sonorant speech syllabic P O speech sonorant P O speech P O syllabic sonorant speech P O V P

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A probabilistic framework for landmark detection based on phonetic features for automatic speech recognition.

A probabilistic framework for a landmark-based approach to speech recognition is presented for obtaining multiple landmark sequences in continuous speech. The landmark detection module uses as input acoustic parameters (APs) that capture the acoustic correlates of some of the manner-based phonetic features. The landmarks include stop bursts, vowel onsets, syllabic peaks and dips, fricative onse...

متن کامل

Persian Phone Recognition Using Acoustic Landmarks and Neural Network-based variability compensation methods

Speech recognition is a subfield of artificial intelligence that develops technologies to convert speech utterance into transcription. So far, various methods such as hidden Markov models and artificial neural networks have been used to develop speech recognition systems. In most of these systems, the speech signal frames are processed uniformly, while the information is not evenly distributed ...

متن کامل

Title of dissertation : SPEECH RECOGNITION BASED ON PHONETIC FEATURES AND ACOUSTIC LANDMARKS

Title of dissertation: SPEECH RECOGNITION BASED ON PHONETIC FEATURES AND ACOUSTIC LANDMARKS Amit Juneja, Doctor of Philosophy, 2004 Dissertation directed by: Carol Espy-Wilson Department of Electrical and Computer Engineering A probabilistic and statistical framework is presented for automatic speech recognition based on a phonetic feature representation of speech sounds. In this acoustic-phone...

متن کامل

Binaural Cues for Fragment-Based Speech Recognition in Reverberant Multisource Environments

This paper addresses the problem of speech recognition using distant binaural microphones in reverberant multisource noise conditions. Our scheme employs a two stage fragment decoding approach: first spectro-temporal acoustic source fragments are identified using signal level cues, and second, a hypothesisdriven stage simultaneously searches for the most probable speech/background fragment labe...

متن کامل

Landmark-based approach to speech recognition: an alternative to HMMs

In this paper, we compare a Probabilistic Landmark-Based speech recognition System (LBS) which uses Knowledge-based Acoustic Parameters (APs) as the front-end with an HMMbased recognition system that uses the Mel-Frequency Cepstral Coefficients as its front end. The advantages of LBS based on APs are (1) the APs are normalized for extra-linguistic information, (2) acoustic analysis at different...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004